Weakly Supervised Temporal Adjacent Network for Language Grounding

نویسندگان

چکیده

Temporal language grounding (TLG) is a fundamental and challenging problem for vision understanding. Existing methods mainly focus on fully supervised setting with temporal boundary labels training, which, however, suffers expensive cost of annotation. In this work, we are dedicated to weakly TLG, where multiple description sentences given an untrimmed video without labels. task, it critical learn strong cross-modal semantic alignment between sentence semantics visual content. To end, introduce novel adjacent network (WSTAN) grounding. Specifically, WSTAN learns by exploiting in instance learning (MIL) paradigm, whole paragraph as input. Moreover, integrate complementary branch into the framework, which explicitly refines predictions pseudo supervision from MIL stage. An additional self-discriminating loss devised both branch, aiming enhance discrimination self-supervising. Extensive experiments conducted three widely used benchmark datasets, i.e., ActivityNet-Captions, Charades-STA, DiDeMo, results demonstrate effectiveness our approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Knowledge Aided Consistency for Weakly Supervised Phrase Grounding

Given a natural language query, a phrase grounding system aims to localize mentioned objects in an image. In weakly supervised scenario, mapping between image regions (i.e., proposals) and language is not available in the training set. Previous methods address this deficiency by training a grounding system via learning to reconstruct language information contained in input queries from predicte...

متن کامل

Mutually exclusive grounding for weakly supervised non-negative matrix factorisation

Non-negative Matrix Factorisation (NMF) has been successfully applied for learning the meaning of a small set of vocal commands without any prior knowledge of the language. This kind of learning is useful if flexibility in terms of the acoustic and language model is required, for example in assistive technologies for dysarthric speakers because they do not comply with common models. Vocal comma...

متن کامل

Weakly Supervised Action Localization by Sparse Temporal Pooling Network

We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video usin...

متن کامل

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

We propose a weakly-supervised framework for action labeling in video, where only the order of occurring actions is required during training time. The key challenge is that the per-frame alignments between the input (video) and label (action) sequences are unknown during training. We address this by introducing the Extended Connectionist Temporal Classification (ECTC) framework to efficiently e...

متن کامل

Learning Representations for Weakly Supervised Natural Language Processing Tasks

Finding the right representations for words is critical for building accurate NLP systems when domain-specific labeled data for the task is scarce. This article investigates novel techniques for extracting features from n-gram models, Hidden Markov Models, and other statistical language models, including a novel Partial Lattice Markov Random Field model. Experiments on partof-speech tagging and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Multimedia

سال: 2022

ISSN: ['1520-9210', '1941-0077']

DOI: https://doi.org/10.1109/tmm.2021.3096087